Goto

Collaborating Authors

 iforest 0


Between Resolution Collapse and Variance Inflation: Weighted Conformal Anomaly Detection in Low-Data Regimes

arXiv.org Machine Learning

Standard conformal anomaly detection provides marginal finite-sample guarantees under the assumption of exchangeability . However, real-world data often exhibit distribution shifts, necessitating a weighted conformal approach to adapt to local non-stationarity. We show that this adaptation induces a critical trade-off between the minimum attainable p-value and its stability. As importance weights localize to relevant calibration instances, the effective sample size decreases. This can render standard conformal p-values overly conservative for effective error control, while the smoothing technique used to mitigate this issue introduces conditional variance, potentially masking anomalies. We propose a continuous inference relaxation that resolves this dilemma by decoupling local adaptation from tail resolution via continuous weighted kernel density estimation. While relaxing finite-sample exactness to asymptotic validity, our method eliminates Monte Carlo variability and recovers the statistical power lost to discretization. Empirical evaluations confirm that our approach not only restores detection capabilities where discrete baselines yield zero discoveries, but outperforms standard methods in statistical power while maintaining valid marginal error control in practice.


Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

arXiv.org Artificial Intelligence

Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phono-tactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.


User-Based Sequential Modeling with Transformer Encoders for Insider Threat Detection

arXiv.org Artificial Intelligence

Insider threat detection presents unique challenges due to the authorized status of malicious actors and the subtlety of anomalous behaviors. Existing machine learning methods often treat user activity as isolated events, thereby failing to leverage sequential dependencies in user behavior. In this study, we propose a User-Based Sequencing (UBS) methodology, transforming the CERT insider threat dataset into structured temporal sequences suitable for deep sequential modeling. We deploy a Transformer Encoder architecture to model benign user activity and employ its reconstruction errors as anomaly scores. These scores are subsequently evaluated using three unsupervised outlier detection algorithms: One-Class SVM (OCSVM), Local Outlier Factor (LOF), and Isolation Forest (iForest). Across four rigorously designed test sets, including combinations of multiple CERT dataset releases, our UBS-Transformer pipeline consistently achieves state-of-the-art performance - notably 96.61% accuracy, 99.43% recall, 96.38% F1-score, 95.00% AUROC, and exceptionally low false negative (0.0057) and false positive (0.0571) rates. Comparative analyses demonstrate that our approach substantially outperforms tabular and conventional autoencoder baselines, underscoring the efficacy of sequential user modeling and advanced anomaly detection in the insider threat domain.


TAD-Bench: A Comprehensive Benchmark for Embedding-Based Text Anomaly Detection

arXiv.org Artificial Intelligence

Existing studies often lack Anomaly detection is a critical task in machine systematic evaluations of how different embeddings learning, with applications ranging from fraud detection perform across diverse anomaly types, raising and content moderation to user behavior questions about their generalization capabilities analysis (Pang et al., 2021). Within natural language in complex, real-world scenarios such as multilingual processing (NLP), anomaly detection has become settings or domain-specific anomalies. Recent increasingly relevant for identifying outliers efforts, such as AD-NLP (Bejan et al., 2023) such as harmful content, phishing attempts, and and NLP-ADBench (Li et al., 2024), have significantly spam reviews. However, while AD tasks in structured advanced anomaly detection in NLP. ADdata (e.g., tabular, time series, graphs) (Steinbuss NLP provides valuable insights into different types and Böhm, 2021; Blázquez-García et al., 2021; of anomalies, while NLP-ADBench expands evaluations Qiao et al., 2024) have achieved significant maturity, to a wide range of algorithms and datasets.


Testing for Outliers with Conformal p-values

arXiv.org Machine Learning

This paper studies the construction of p-values for nonparametric outlier detection, taking a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a broadly applicable framework which yields p-values that are marginally valid but mutually dependent for different test points. We prove these p-values are positively dependent and enable exact false discovery rate control, although in a relatively weak marginal sense. We then introduce a new method to compute p-values that are both valid conditionally on the training data and independent of each other for different test points; this paves the way to stronger type-I error guarantees. Our results depart from classical conformal inference as we leverage concentration inequalities rather than combinatorial arguments to establish our finite-sample guarantees. Furthermore, our techniques also yield a uniform confidence bound for the false positive rate of any outlier detection algorithm, as a function of the threshold applied to its raw statistics. Finally, the relevance of our results is demonstrated by numerical experiments on real and simulated data.